Statistics 2: Probabilities, Distributions, and Tests

Notebook Summary

This notebook is a series of exercises to practice utilizing probabiliteis, distributions and tests by answering questions in relation to titianic passenger data. The following will be presented with headers that incorperate the questions followed by the calculations and a written summary of the result.

Importing data file


In [1]:
import numpy as np
import pandas as pd

titanic_data = pd.read_csv('train.csv')
titanic_data.head(5)


Out[1]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Cleaning and filling data

Checking to see what columns need to be filled using the .info() method.


In [2]:
titanic_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Filling all NaN ages with the mean of all the ages and confirming with .info() method. We later compensate for this with a functions that remove the mean. This will be pointed out with a 'COMPENSATION:' and define the action as it arises.


In [3]:
titanic_data.Age = titanic_data.Age.fillna(np.mean(titanic_data.Age))

In [4]:
titanic_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Question 1

Calculate the probability of survival during Titanic crash.


In [5]:
survivors = titanic_data[titanic_data.Survived == 1]
survivor_prob = (len(survivors) / len(titanic_data))
print("There is a " + str(survivor_prob) + " percent chance of survival.")


There is a 0.3838383838383838 percent chance of survival.

Question 2

CHOOSE TWO OF...

  • A passenger was male
  • A passenger was female and had atleast 1 SibSp on board
  • A survivor was from Cherbourg

Our first choice of question is to see the probability that a passenger was male.


In [6]:
male_passenger = titanic_data[titanic_data.Sex == 'male']
prob_male = (len(male_passenger) / len(titanic_data))
print("There is a " + str(prob_male) + " percent probability that a passenger was male.")


There is a 0.6475869809203143 percent probability that a passenger was male.

Our second choice of question is to find the probablity that a survivor was from Cherbourg.


In [7]:
c_port = survivors[survivors.Embarked == 'C']
prob_c = (len(c_port) / len(survivors))
print("There is a " + str(prob_c) + " percent probability that a survivor was from Cherbourg.")


There is a 0.2719298245614035 percent probability that a survivor was from Cherbourg.

Question 3

Plot the distribution of passenger ages. (Bins = 25)

COMPENSATION: Earlier all the ages that were replaced with the mean. Because that would centrally skew the distribution histogram we round the passenger age decimial to the third and match it against the mean roudned to the third decimal and remove that value from the data set. By removing those fills the data will be represented correctly. (This is waht the for loop does below.)

In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
all_ages = []
age_mean = np.mean(titanic_data.Age)

for i, k in enumerate(titanic_data.Age): 
    if round(k, 3) != round(age_mean, 3):
        all_ages.append(k)

H, edges = np.histogram(all_ages, bins=25)

ax = plt.subplot(111)
ax.bar(edges[:-1], H / float(sum(H)), width=edges[1] - edges[0])
ax.set_xlabel("Passenger Age")
ax.set_ylabel("Frequency of Being on Board")
ax.minorticks_on()
plt.show()


Question 4

Find the probability that a passenger was less than 10 years old.


In [9]:
less_then_ten = []
for i in all_ages:
    if i < 10:
        less_then_ten.append(i)
        
prob_less_then_ten = (len(less_then_ten) / len(all_ages))
print("There is a " + str(round(prob_less_then_ten, 3)) + " probabililty that a passenger was less then 10 year old.")


There is a 0.087 probabililty that a passenger was less then 10 year old.

Question 5

Given 100 passengers at random, determine the probability that exactly 42 passengers survive.


In [10]:
from scipy.stats import binom
binom.pmf(42, 100, survivor_prob)


Out[10]:
0.061330411815167886

There is a 0.0613 probability that exactly 42 passenger survive out of 100. See above 'Out' for a more precise probability.

Question 5

What is the probability that at least 42 of those 100 passenger survive?


In [11]:
1 - binom.cdf(42, 100, survivor_prob)


Out[11]:
0.19807683025744727

There is a 0.198 probability that at least 42 of those 100 passenger survive. See above 'Out' for a more precise probability.

Question 6

Is there a statistically significant difference between the age of male and female survivors?

COMPENSATION: Within our male and female survival groups we also had a match to throw out any age that matches the mean to the third the decimal to remove the skew. Post skew compensation increased the p value, indicating a correct compensation.


In [12]:
from scipy.stats import ttest_ind

survivors_male = survivors[(survivors.Sex == 'male') & (round(survivors.Age,3) != round(age_mean, 3)) ]
survivors_female = survivors[(survivors.Sex == 'female') & (round(survivors.Age, 3) != round(age_mean, 3))]
t_stat, p_value = ttest_ind(survivors_male.Age, survivors_female.Age)

print("Results:\n\tt-statistic: %.5f\n\tp-value: %.5f" % (t_stat, p_value))


Results:
	t-statistic: -0.83512
	p-value: 0.40434

There is no significance between the age of female and male survivors. This is because the p-value is greater than 0.05.

COMPENSATION: The age ditribution plot below against survivors also has our compensation affect which reduced a spike of survivors at the 25-30 age range.


In [13]:
plt.figure(figsize=(10, 4))
opacity = 0.5

plt.hist(survivors_male.Age, bins=np.arange(0, 90, 5), alpha=opacity, label="Males")
plt.hist(survivors_female.Age, bins=np.arange(0, 90, 5), alpha=opacity, label="Females")
plt.legend()
plt.title("Age Distribution of Female and Male Survivors")
plt.xlabel("Ages")
plt.ylabel("Survival")
plt.show()


Question 7

Is there a statistically significant difference between the fares paid by passengers between Queentown and the passengers from Cherbourg?


In [14]:
from scipy.stats import ttest_ind
fare_from_q = titanic_data[titanic_data.Embarked == 'Q']
fare_from_c = titanic_data[titanic_data.Embarked == 'C']
t_stat, p_value = ttest_ind(fare_from_q.Fare, fare_from_c.Fare)

print("Results:\n\tt-statistic: %.5f\n\tp-value: %g" % (t_stat, p_value))


Results:
	t-statistic: -4.84439
	p-value: 2.26359e-06

There is statistical difference in the fares paid between the passengers at Queentown and Cherboug. This is indicated by the p-value that is less than 0.01.


In [15]:
plt.figure(figsize=(10, 4))
opacity = 0.5

plt.hist(fare_from_q.Fare, bins=np.arange(0, 90, 5), alpha=opacity, label="Queenstown")
plt.hist(fare_from_c.Fare, bins=np.arange(0, 90, 5), alpha=opacity, label="Cherbourg")
plt.legend()
plt.title("Fare Distribution from Queenstown to Cherbourg")
plt.xlabel("Fare Price")
plt.ylabel("Number of Passengers")
plt.show()